CROSS-AGE AND CROSS-SITE DOMAIN SHIFT IMPACTS ON DEEP LEARNING-BASED WHITE MATTER FIBER ESTIMATION IN NEWBORN AND BABY BRAINS

Deep learning models have shown great promise in estimating tissue microstructure from limited diffusion magnetic resonance imaging data. However, these models face domain shift challenges when test and train data are from different scanners and protocols, or when the models are applied to data with inherent variations such as the developing brains of infants and children scanned at various ages. Several techniques have been proposed to address some of these challenges, such as data harmonization or domain adaptation in the adult brain. However, those techniques remain unexplored for the estimation of fiber orientation distribution functions in the rapidly developing brains of infants. In this work, we extensively investigate the age effect and domain shift within and across two different cohorts of 201 newborns and 165 babies using the Method of Moments and fine-tuning strategies. Our results show that reduced variations in the microstructural development of babies in comparison to newborns directly impact the deep learning models’ cross-age performance. We also demonstrate that a small number of target domain samples can significantly mitigate domain shift problems.


INTRODUCTION
The human brain undergoes notable changes during development, particularly in white matter tracts that modulate cognitive and motor functions [1].Accurately estimating these fibers is crucial for understanding developmental patterns and detecting abnormalities.Advances in diffusion magnetic resonance imaging (dMRI) have provided unprecedented insights into the human brain microstructure.Traditional methods, such as Constrained Spherical Deconvolution (CSD) [2] and Multi-Shell Multi-Tissue Constrained Spherical Deconvolution (MSMT-CSD) [3], have been employed to ★ H. Kebiri and M. Bach Cuadra -Equal contribution.reconstruct fiber orientation distribution functions (FODs) as a proxy to the underlying microstructure.These methods often require a large number of diffusion measurements and/or multiple  values, making them less feasible for uncooperative young subjects.Recently, deep learning (DL) on large datasets has allowed precise FOD estimation [4]- [6] with as few as six diffusion samples from developing brains [7].
While DL can offer significant scanning time reduction, it is particularly faced with domain-shift problems.Such shifts can be attributed to several factors, from biological differences [8] such as age or pathologies [1] to imaging variations in protocols and scanner types (brand or field strength) [9].
Data harmonization has been used for reducing variability across sites while preserving data integrity [10].The dominant dMRI method operating at the signal level is the Rotation Invariant Spherical Harmonics (RISH) [11] that harmonizes dMRI data without model dependency but requires similar acquisition protocols and site-matched healthy controls.Deep learning techniques offer solutions to nonlinear harmonization but risk overfitting and require extensive training data, potentially altering pathological information [12].The Method of Moments (MoM) [13], which aligns diffusion-weighted imaging (DWI) features via spherical moments, stands out for its directionality preservation and independence from matched acquisition protocol or extensive training.Therefore, MoM presents a potentially beneficial approach for addressing the domain shift challenges.
Furthermore, domain adaptation (DA) methods have been used to address domain shifts in medical imaging [14].Supervised DA, particularly fine-tuning (FT) models with pretrained weights on source domain data, is a common method, often augmented with advanced, more targeted techniques [15].Semi-supervised DA methods, which leverage a mix of labeled and unlabeled data, can also effectively bridge domain shifts.However, both semi-supervised and unsupervised DAs face challenges in the case of significant anatomical differences, such as those between infants and neonates, where the assumption of feature space similarity may not hold.This paper investigates the domain shift effects in a DL method [7] for white matter FOD estimation in the newborn and baby populations.Our goal is to provide a detailed examination of the challenges associated with domain shifts, particularly age-related variations between these young cohorts.We propose possible solutions and emphasize the need for robust frameworks that can cope with the unique variability present in the developing brain.

Data Processing
We used dMRI data from the 3 rd release of the Developing Human Connectome Project (dHCP) [16] and the Baby Connectome Project (BCP) [17].The dHCP dataset includes 783 subjects from 20-44 post-menstrual weeks, acquired using a 3T Philips scanner and a multi-shell sequence ( ∈ {0, 400, 1000, 2600} s/mm 2 ).Its dMRI data release has been preprocessed by SHARD [18], with a final data resolution of 1.5 3 mm 3 .White Matter and Brainstem masks, which dHCP also provides, were registered to the dMRI data space and combined with voxels of fractional anisotropy (FA) > 0.3 to produce the final white matter (WM) mask.The BCP dataset comprises 285 subjects from 0-5 years, scanned using a 3T Siemens scanner with a different multi-shell protocol ( ∈ {500, 1000, 1500, 2000, 2500, 3000} s/mm 2 ).Denoising and bias, motion, and distortion corrections [19] also yielded a 1.5 3 mm 3 resolution for the BCP dMRI data.The final WM mask was established using an OR operation among STAPLE [20]-generated WM mask, FA > 0.4, and voxels with FA > 0.15 and mean diffusivity (MD) > 0.0011.We also computed the mean FA value within the white matter mask of each subject to analyze the relationship between age and FA value.

Model
As illustrated in Fig. 1, our model's workflow is divided into three stages: initial training on a source dataset, followed by separate processes of either fine-tuning or MoM harmonization on the target dataset with varying number of subjects.During inference, harmonized data is tested using the originally trained model to evaluate the effectiveness of MoM, while the fine-tuned model is applied to the original target data to assess the improvements in FOD estimation.

Backbone Model
We employed the U-Net-like network [7] as the backbone for our experiments.Its proficiency lies in estimating accurate FODs from dMRI data with six diffusion directions with an extensive field of view (FoV) of 16 × 16 × 16, and its demonstrated accurate results when applied to dHCP newborns.
We applied the MSMT-CSD [3] using all measurements (i.e., 300 diffusion directions for dHCP and 151 directions for BCP) to generate ground-truth (GT) FODs for training and evaluation.To ensure a representative sample, subjects were randomly selected from the datasets based on the desired age range and number of subjects required by each experiment detailed in Sections 2.3 and 2.4.For each subject, we processed the diffusion signal by selecting six optimal gradient directions, normalizing, projecting onto the spherical harmonic (SH)-basis, and cropping to 16 3 patches as in [7].
For each experiment, we trained the backbone network for 1000 epochs using the Adam optimizer [21] with an initial learning rate of 5×10 −5 , weight decay of 1×10 −3 , and batches of 35.We used a dropout of 0.1 to prevent overfitting.Model selection was based on the lowest mean squared error (MSE) between predicted and GT FODs in the validation set.

Methods for Addressing Domain Shifts
We explore two primary data harmonization and domain adaptation strategies to handle domain shifts.
Data Harmonization using Method of Moments MoM [13] was employed to harmonize DWI data across sites by aligning the mean and variance using linear mapping functions  ={ , } () =  + .This approach adjusts each voxel's DWI signal  to the reference site's characteristics.Median images of these moments, smoothed with a Gaussian filter to mitigate artifacts, were computed from the six optimal gradient directions and used to derive the harmonization parameters  and .

Domain Adaptation using Fine-Tuning
This process involved knowledge transfer and additional training of the model on the target domain data.We conducted fine-tuning over 100 epochs, with a reduced learning rate of 5 × 10 −6 and smaller batches of 10.

Implementation Details and Code Availability
Training and fine-tuning were performed on an NVIDIA RTX 3090 GPU.We use TensorFlow 2.11 for our DL framework and MATLAB R2022b for MoM harmonization.Our code is publicly available at http://github.com/Medical-Ima ge-Analysis-Laboratory/dl_fiber_domain_shift.

Intra-Site Age-Related Evaluation
We assessed baseline performance on the dHCP and BCP datasets, selecting 100 subjects from specified age ranges (dHCP: 29.3-44.3post-menstrual weeks; BCP: 1.5-60 postnatal months), and allocated them into training, validation, and testing sets (70/15/15).The backbone model was trained and tested on these splits, respectively for BCP and dHCP.GT consistency was evaluated by processing two mutually exclusive subsets of the full measurements with MSMT-CSD (referred to as Gold Standards, GS), as in [7].
To investigate age-related shifts within each site, we conducted age-specific training and cross-testing.The dHCP dataset was split into two age groups: [26.7, 35.0] and [40.0, 44.4] weeks, and the BCP dataset into [0.5, 11] and [20,36] months, denoted as young and old, respectively.Each group consisted of 60 subjects, split into 40/10/10 partitions for training, validation, and testing, respectively.Fine-tuning was also performed across different age groups within dHCP using 5 subjects from the corresponding target age group.

Inter-Site Experiments
To address both age-related and cross-site domain shifts, we conducted cross-testing between dHCP and BCP with respective baseline models from Section 2.3.We also evaluated how varying subject numbers in the target training dataset (1, 2, 5, and 10 subjects) affect the performance of MoM harmonization and fine-tuning.Furthermore, an ablation study involved training a model from scratch on 10 target dataset subjects to verify performance gains beyond target set familiarity.

Evaluation Metrics
We quantitatively assessed FOD estimation accuracy using metrics as per [7]: Agreement Rate (AR) for peak count consistency, Angular Error (AE) for angular discrepancy between predicted and GT FODs, and Apparent Fiber Density (AFD) from [22] to evaluate fiber density.

Intra-Site Experiments
The DL metrics are first compared to the GT consistency (GS in Table 1) for dHCP and BCP.As previously reported in [23], single-fiber predictions show good agreement, but performance decreases with multiple fibers for both datasets.This is more pronounced in three-fiber cases, which exhibit low DL performance as also reported in [7].We therefore not consider 3-fiber metrics in subsequent experiments.
Moving to age-specific comparisons, we observed different patterns between younger and older age groups (denoted as "y" and "o", respectively), for dHCP and BCP datasets.Fig. 2 illustrates these differences, revealing that age-related effects in BCP are less marked compared to dHCP.For instance, the difference in single-fiber ARs and AFD between DL y y and DL o o is higher within dHCP than BCP, approximately 14% and 21% for AFD error, respectively.
Within each age group, we see similar stability in BCP compared to dHCP when training on young and testing on old subjects or vice versa.This consistency could be due to rapid development and white matter changes in the first months of life, as opposed to slower white matter development in later periods [24].To validate this hypothesis, we explored the average white matter FA in our cohorts (Fig. 3), which indeed shows a significant increase in dHCP but a plateau in the BCP cohort.In general, AE seems to be less prone to age effect, except when training on older dHCP subjects and testing on younger ones (DL y y ), where there is also a more pronounced decline in AR.Finally, fine-tuning on five dHCP subjects consistently reduces error rates, especially for AFD error.

Inter-Site Experiments
Given the high number of domain shifts (scanner, protocol, age) and the low agreement of GS/DL of multiple fiber depictions (Table 1), we compare the cross-site results on single fiber populations only.Inter-site performance, depicted in Fig. 5, shows the capability of the DL model to generalize across different datasets (as shown in Fig. 4).AR when tested on dHCP (as shown in Fig. 5 (a)) displays a marked increase with fine-tuning on one and two subjects.However, increments from 2 to 5 and 5 to 10 subjects offer only marginal gains in AR.Fine-tuning exhibits an advantage over MoM particularly when transitioning from BCP to dHCP.In both directions, AE presents a slight improvement when more target subjects are incorporated into fine-tuning the model, although the extent of this improvement is not much pronounced (2-3°).Moreover, the MoM harmonization is less sensitive to the number of subjects used, both for AE and AR.As for the ablation study, where the model was In summary, the inter-site experiments show that refining the DL-based fiber estimation pipeline using few target domain subjects in fine-tuning or MoM outperforms direct cross-testing and can make the accuracy closer to direct testing in some cases such as AR in dHCP testing shown in Fig. 5 (a).This improvement is significantly more visible for transfers from BCP to dHCP than vice versa, suggesting some distinct dynamics at play when adapting a model from a dataset with pronounced age-related shifts (dHCP) to one with more gradual changes (BCP), and less so in the reverse direction.

CONCLUSION
This work has demonstrated that even a small number of target data samples can be instrumental in overcoming domain shifts encountered in white matter fiber estimation with deep learning.Through the application of fine-tuning and to a lesser extent the MoM harmonization strategies, models have shown improved performance in estimating the FODs in developing brains in both cross-age and cross-site and age settings.Moreover, we observed that the lower variations in the microstructural development of babies compared to newborns, have a direct influence on the performance of the DL models in the cross-age experiments.Such findings highlight the importance of tailoring DL models to account for the unique developmental stages of pediatric populations.

COMPLIANCE WITH ETHICAL STANDARDS
This retrospective research study used open-source human subject data from the Developing Human Connectome Project and the Baby Connectome Project, respectively, where ethical approval was not required per the data licenses.

Fig. 1 .
Fig. 1.Diagram of the workflow separated into: (1) initial model training on the source dataset  train , (2) MoM harmonization and model fine-tuning applied independently on the target training dataset  train with varying subject numbers ({1,2,5,10}), and (3) inference where the original model assesses harmonized data  ′ infr , and the fine-tuned model evaluates the original target data  infr .

Fig. 2 .Fig. 3 .Fig. 4 .
Fig. 2. Comparative intra-site performance of DL models across age-specific training in dHCP (top) and BCP (bottom), showing AR and AE under 1/2-F configurations alongside the AFD Error.DL a b denotes models trained on "a" and tested on "b"; further fine-tuned on "b" when followed by FT .

Fig. 5 .
Fig.5.Inter-site performance of BCP-trained models tested on dHCP (a, b), and dHCP-trained models tested on BCP (c, d), comparing MoM and FT methods using varying subjects (1, 2, 5, 10) from the target domain under single-fiber configuration, with cross-testing and self-testing serving as lowerand upper-performance bounds ("Control"), respectively.